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Resumen — Hybrid neuro-evolutionary algorithms 
may be inspired on Darwinian or Lamarckian evolu- 
tion. In the case of Darwinian evolution, the Baldwin 
effect, that is, the progressive incorporation of learned 
characteristics to the genotypes, can be observed and 
leveraged to improve the search. 

The purpose of this paper is to carry out an exper- 
imental study into how learning can improve G-Prop 
genetic search. Two ways of combining learning and 
genetic search are explored: one exploits the Baldwin 
effect, while the other uses a Lamarckian strategy. 

Our experiments show that using a Lamarckian op- 
erator makes the algorithm find networks with a low 
error rate, and the smallest size, while using the Bald- 
win effect obtains MLPs with the smallest error rate, 
and a larger size, taking longer to reach a solution. 

Both approaches obtain a lower average error than 
other BP-based algorithms like RPR.OP, other evolu- 
tionary methods and fuzzy logic based methods. 

Palabras clave — Evolutionary Algorithms, General- 
ization, Learning, Neural Networks, Optimization, 
Baldwin Effect, Lamarckian Search 

I. Introduction and State of the Art 

Hybrid algorithms often implement non- 
Darwinian ideas, e.g. Lamarckian evolution or 
the Baldwin effect, where learning influences 
evolution. 

Lamarck's theory states that the characteristics 
an individual acquires during its life are passed to 
the offspring . Thus, the following generation will 
inherit any acquired or learned characteristic, this 
mechanism would be responsible for the evolution of 
species. According to this approach, learning has a 
great influence on evolution, since all the characteris- 
tics learned are passed on to the following generation. 

Nevertheless, Baldwin |2] and Waddington [3] ar- 
gued that this influence is limited to the fact that the 
individuals with greater learning capacity will adapt 
better to the environment, and thus will live longer. 
The longevity they acquire allows them to have more 
offspring through time, and propagate their abilities. 
As the number of offspring who have acquired the 
ability grows, this characteristic becomes part of the 
genetic code. 

These ideas have previously been used by numer- 
ous researchers in different approaches: 

• Lamarckian mechanisms in hybrid evolutionary 
algorithms. Lamarckian theory is today totally 
discredited from the biological point of view, but 
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it is possible to implement Lamarckian evolution 
in EAs, so that an individual can modify its ge- 
netic code during or after fitness evaluation (its 
"lifetime" ) . These ideas have been used by sev- 
eral researchers with particular success in prob- 
lems where the application of a local search op- 
erator obtains a substantial improvement (trav- 
elling salesman problem, Gorges-Schleuter 0], 
Merz and Freisleben [5], Ross In general, 
hybrid algorithms are nowadays acknowledged 
as the best solution to a wide array of optimiza- 
tion problems. 

• Studying the Baldwin effect in hybrid algorithms 
0, El- 0. EH. EH Some authors have studied 
the Baldwin effect, carrying out a local search 
on certain individuals to improve their fitness 
without modifying the genetic code of the indi- 
vidual. This is the strategy proposed by Hin- 
ton and Nowlan in [7|, who found that learning 
alters the shape of the search space in which 
evolution operates and that the Baldwin effect 
allows learning organisms to evolve much faster 
than their nonlearning equivalents, even though 
the characteristics acquired by the phenotype 
are not communicated to the genotype. Ack- 
ley and Liftman ^U] studied the Baldwin effect 
in an artificial life system, obtaining the result 
that experiments in which the individuals had 
learning capabilities obtained the best results. 
Boers et al. E2 describe a hybrid algorithm to 
evolve ANN architectures, whose effectivity is 
explained with the Baldwin effect, implemented 
not as a process of learning in the network, but 
changing the network architecture as part of the 
learning process. 

• Comparative studies of Lamarckian mechanisms 
and the Baldwin effect in hybrid algorithms. 
Some studies have investigated whether a strat- 
egy based on a hybrid algorithm that takes ad- 
vantage of the Baldwin effect is better or worse 
than one implementing Lamarckian mechanisms 
to accelerate the search U^J. The results ob- 
tained arc different, and very dependent on the 
problem. Gruau and Whitley compared 
Baldwinian, Lamarckian and Darwinian mecha- 
nisms implemented in a genetic algorithm that 
evolves ANNs, finding that the first and the 
second strategies are equally effective for solv- 
ing their problem. Nevertheless, for another 
problem, the results obtained by Whitley et al. 
|T1] show that taking advantage of the Bald- 
win effect can find the global optimum, while 
a Lamarckian strategy, although faster, usually 



converges to a local optimum. 
On the other hand, results obtained by Ku and 
Mak \H>\ with a GA designed to evolve recurrent 
neural networks, show that the use of a Lamar- 
ckian strategy implies an improvement of the 
algorithm, while the Baldwin effect does not. In 
Houck et al. JH| several algorithms are stud- 
ied, and similar conclusions drawn, as in |17| . 
where a comparison between the Darwinian, 
Baldwinian and Lamarckian mechanisms, ap- 
plied to the 4-cycle problem, is made. 

G-Prop (a genetic evolution of BP trained MLP), 
used in this paper to tune learning parameters and 
to set the initial weights and hidden layer size of a 
MLP, searches for the optimal set of weights, the op- 
timal topology and learning parameters, using an EA 
and Quick-Propagation (QP)- I n this method no 
ANN parameters have to be set by hand; it obviously 
needs to set the EA constants, but is robust enough 
to obtain good results under the default parameter 
settings (all operators applied with the same prob- 
ability, 300 generations and 200 individuals in the 
population). 

This paper carries out a study of the Baldwin ef- 
fect in the G-Prop US], [231: EH, HH method to 
solve pattern classification and function approxima- 
tion problems. We compare results with those of 
other authors, and intend to check the results ob- 
tained by Gruau and Whitley i- e -> that the use 
of learning that modifies fitness without modifying 
the genetic code improves the task of finding an ANN 
to solve the problem at hand. 

We compare the results obtained taking advantage 
of the Baldwin effect with those obtained using a 
Lamarckian local search mechanism. We will also 
compare them with other non- hybrid (RPROP |23|). 
hybrid algorithms, and those based on fuzzy logic, 
to prove that both versions of G-Prop obtain better 
results (or at least comparable) than other methods, 
although one of these versions is more likely to be 
trapped at a local optimum due to the fact that it 
uses a local search genetic operator. 

The remainder of this paper is structured as fol- 
lows: Section|n]presents the new fitness functions de- 
signed to determine if the Baldwin effect takes place 
in G-Prop. Section 1 1 i 1 1 describes the experiments, 
Section HV1 presents the results obtained, followed by 
a brief conclusion in Section Ivl 

II. The G-Prop Algorithm 

In this section we will only describe the new fitness 
functions designed to determine if the Baldwin ef- 
fect takes place in G-Prop. The complete description 
of the method and results on classification problems 
have been presented elsewhere ^!5], [20]; EH E2] • 

In G-Prop, the Darwinian fitness function is given 
by the classification / approximation ability obtained 
when carrying out the validation after training, and 
in the case of two individuals with identical ability 
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neurons, which implies greater speed when training 
and classifying and facilitates its hardware imple- 
mentation. 

The classification accuracy or number of hits is 
obtained by dividing the number of hits among the 
total number of examples in the validating set. The 
approximation ability is obtained using the normal- 
ized mean squared error (NMSE) given by: 



NMSE 



\ 



J2i ( s * 
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where Sj is the real output for the example i, Oi is 
the obtained output, and s is the mean of all the real 
outputs. 

The Lamarckian approach uses no special fitness 
function; instead, a local search genetic operator (QP 
application) has been designed to improve the indi- 
viduals, saving the individual trained weights (ac- 
quired characteristics) back to the population. 

On the other hand, the Baldwin effect requires 
some type of learning to be applied to the in- 
dividuals, and the changes (trained weights) are 
not codified back to the population. In order to 
take advantage of the Baldwin effect, the follow- 
ing fitness function is proposed: firstly the classi- 
fication/approximation ability on the validation set 
of the individual before being trained is calculated. 
Then it is trained and its ability (after training) is 
calculated. Three criteria are used to decide which is 
the best individual: the best MLP is that with higher 
classification/approximation ability after training; if 
both MLPs show the same accuracy, then the best 
is that whose ability before training is higher (the 
MLP is more likely to have a high accuracy when 
trained); if both MLPs show the same accuracy be- 
fore and after training, then the best is the smallest 
one. 

III. Experiments 

The algorithm was run for a fixed number of gen- 
erations. When training each individual of the pop- 
ulation to obtain its fitness, a limit of epochs was 
established. We used 300 generations and 200 indi- 
viduals in the population in every run, and 200 train- 
ing epochs in order to avoid long simulation times 
and also to avoid overfitted networks, making the EA 
carry out the search and the training operator refine 
the solutions. In addition, the number of epochs cho- 
sen was much smaller than that necessary to train a 
single MLP, so that the time taken to find a suitable 
network to solve the problem is similar to that would 
be needed to train a MLP (that obtains similar re- 
sults) using a method based on gradient descent. Af- 
ter an exhaustive test of genetic operators, we have 
considered to apply them with the same priority (see 
[E], [20], EH> El)- The learning operator (see |2T] . 
\'22\ ) was only used when obtaining the results of the 
Lamarckian approach. 

The tests used to assess the accuracy of a method 
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(exclusive-or problem) are not suitable for certain ca- 
pacities of the BP algorithm, such as generalization 
|24| . Our opinion, along with Prechelt is that to 
test an algorithm, at least two real world problems 
should be used. 

We have used a pattern classification problem 
and a function approximation problem, in order to 
demonstrate the capacities of the proposed method 
solving different kind of problems, and also to show 
that the Baldwin effect takes place whatever the 
problem at hand is. 

In these experiments, the Glassla (extracted from 
Probenl data sets) pattern classification problem, 
proposed by Prechelt [53] and used by Gronroos |2*rj] , 
is used, as well as the function approximation prob- 
lem given by equation £|J). 

Glassla is a problem of classification of glass 
types, taken from The results of chemical 

analysis of glass splinters (percent content of 8 
different elements) plus the refractive index are 
used to classify the sample as either float pro- 
cessed or non float processed building windows, 
vehicle windows, containers, tableware, or head 
lamps. This task is motivated by forensic needs 
in criminal investigation. This dataset was cre- 
ated based on the glass problem dataset from 
the UCI repository of machine learning databases 
( http:/ / www. ics.uci. edu/ mlearn/MLRepository. html ) . 
The data set contains 214 instances. Each sample 
has 9 attributes plus the class attribute: refractive 
index, sodium, magnesium, aluminium, silicon, 
potassium, calcium, barium, iron, class attribute 
(type of glass). 

Function given by equation 0) is an analytical 
function gathered by Cherkassky [22 and Sugeno 
[2B], and used by Pomares their research on fuzzy 
logic function approximation: 
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where x € [0, 1]. 



IV. Results 



Figure ^ shows average results over all the runs 
for both Lamarckian and Baldwinian approaches to 
classification and approximation problems. Plotted 
data correspond to the best individual in the pop- 
ulation for each generation. The dotted line corre- 
sponds to the classification / approximation ability 
before training, while the dashed line corresponds to 
the classification / approximation ability after train- 
ing, in the Baldwinian approach. The solid line cor- 
responds to the Lamarckian approach. 

Standard deviation values have not been plotted 
due to the fact that those values remain constant 
along the generations; in any case, they show that 
error achieved is similar. 

Using Lamarckian evolution a suitable MLP is 
found in the early epochs of the simulation. However 
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Average results over all the runs for both Lamarckian and 
Baldwinian approaches for the Glass problem. Average error 
is plotted above (on a vertical logscale) and size below. 



(evolution stops) because of the use of an "elitist" al- 
gorithm, and tends to dominate the population due 
to its high fitness. 

On the other hand, using the Baldwin effect, re- 
sults can be as good as using Lamarckian evolution, 
although the method needs many more generations 
and the evolution of the population is much more 
progressive during the simulation. Results in size 
show that using the Lamarckian approach, MLPs are 
smaller than with the Baldwinian approach. 

The method exhibits roughly the same behaviour 
on function approximation problem f^. 

Although it is not the aim of this paper to com- 
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we do so in order to prove the capacity of both ver- 
sions of G-Prop to solve pattern classification and 
function approximation problems, and how it out- 
performs other methods. 

Tables [I] and [H] show the average error rate, the 
average size of nets as the number of parameters, 
that is, the number of weights of the net, and the 
average number of generations until the best one for 
that run is found. 

The results for the Glass la pattern classification 
problem (% of error in test), obtained using the 
Lamarckian mechanism are compared with those ob- 
tained taking advantage of the Baldwin effect and 
those obtained by Prechelt gSj (using RPROP |2H], 
|30| ) and Gronroos (using a hybrid algorithm) 
in Table D 



Approach 


Error 


Size 


Generations 


Lamarckian 


32 ± 2 


59 ± 28 


52 ± 54 


Baldwinian 


31 ± 2 


112 ± 62 


119 ± 55 


Prechelt 


33 ± 5 


350 




Gronroos 


32 ± 5 


350 





TABLA I 

Results for the Glassla problem obtained with G-Prop 
taking advantage of the Baldwin effect and for the 
Lamarckian approach, as well as those obtained by Prechelt 
and Gronroos, which are included for the sake of comparison. 



It is evident that G-Prop outperforms other meth- 
ods (both in classification accuracy and network size 
obtained): Prechelt using RPROP [5U] 
obtained a classification accuracy of 33 ± 5, and 
Gronroos using Kitano's network obtained 32±5; 
while G-Prop achieves an error of 32 ± 2 using the 
Lamarckian approach and 31 ± 2 taking advantage 
of the Baldwin effect. 

In the case of the configuration that verifies if the 
Baldwin effect takes place in G-Prop, the classifica- 
tion ability obtained is greater, although the size is 
greater and more generations are needed to reach 
similar results. 

The results for the fa function approximation 
problem, obtained using the Lamarckian mechanism, 
are compared with those obtained taking advantage 
of the Baldwin effect and those obtained by Pomares 
E2| in Table HD 



Approach Error Size Generations 

Lamarckian 0.09 ± 0.01 18 ± 8 34 ± 28 

Baldwinian 0.086 ± 0.004 85 ± 27 97 ± 81 

Pomares |2D 0.125 6 (4 rules) 

TABLA II 

Results for the /a problem obtained with G-Prop taking 
advantage of the Baldwin effect and for the Lamarckian 
approach, as well as those obtained by Pomares, which are 
included for the sake of comparison. 



The proposed method obtains better results on 
approximation ability (0.09 ± 0.01 versus 0.125) al- 
though the networks obtained are greater in size 
and number of parameters than those obtained using 
fuzzy controllers (18 ± 8 versus 6). 

The approximation ability obtained is greater us- 
ing the configuration that verifies if the Baldwin ef- 
fect takes place in G-Prop, while the network sizes 
are slightly larger and need more generations to reach 
similar results. 

Each run of the proposed method takes about 4 
hours on an AMD-K7(tm) 600Mhz, using the pa- 
rameters described above. 

The Lamarckian strategy achieves good enough re- 
sults using, on average, fewer generations, although 
the Baldwinian strategy, using a suitable number of 
generations, can achieve the same or even better re- 
sults. 

The results obtained show that if the problem does 
not have many local minima and results must be ob- 
tained quickly, the best strategy is the Lamarckian. 
Otherwise, the Baldwinian strategy or a mixture of 
both is the best. 

V. Conclusions 

A study of the Baldwin effect in the G-Prop 
method [T§|, HJ, EB, E3 (a hybrid algorithm to 
tune learning parameters, initial weights and hid- 
den layer size of a MLP using an EA and QP) has 
been carried out. A comparison between the re- 
sults obtained taking advantage of the Baldwin effect 
and those obtained using a local search Lamarckian 
mechanism has been made. 

The results obtained, agree with those presented 
by Whitley et al. ^1], and show that the use of a 
Lamarckian strategy makes the method obtain good 
solutions faster than if the Baldwin effect is used, 
although it is more likely to be trapped in a local 
optimum than the approach that takes advantage of 
the Baldwin effect. However, errors are not significa- 
tively worse. 

Figures show how a Lamarckian strategy finds a 
suitable MLP in the early generations which remains 
the best during the simulation (evolution stops); 
with the Baldwin effect, results can be as good as 
those of Lamarckian evolution, although the method 
needs many more generations and the evolution is 
much more progressive. 

It should be observed also that neural nets ob- 
tained using a Lamarckian strategy are smaller, 
which contributes to learning speed. Besides a small 
network is fast while training and classifying, and 
obtaining it in fewer generations means less time is 
needed to design it. 

Another interesting result is that when the Lamar- 
ckian training operator is used, learning contributes 
more to fitness improvement at the beginning of the 
simulations [3T] . This is due to to the use of an eli- 
tist algorithm, so that when it is applied to a MLP, 
these individuals can obtain an advantage in relation 
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continue to be the best individuals among the pop- 
ulation until the end of the simulation. This can be 
also proved using visualization techniques |32| . 
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